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Abstract 


Within the last 15 years, two related coronaviruses (Severe Acute Respiratory Syndrome [SARS]-CoV and Middle East Respiratory 
Syndrome [MERS]-CoV) expanded their host range to include humans, with increased virulence in their new host. Coronaviruses were 
recently found to have little intrinsic disorder compared with many other virus families. Because intrinsically disordered regions have 
been proposed to be important for rewiring interactions between virus and host, we investigated the conservation of intrinsic disorder 
and secondary structure In coronaviruses in an evolutionary context. We found that regions of intrinsic disorder are rarely conserved 
among different coronavirus protein families, with the primary exception of the nucleocapsid. Also, secondary structure predictions 
are only conserved across 50-80% of sites for most protein families, with the implication that 20-50% of sites do not have conserved 
secondary structure prediction. Furthermore, nonconserved structure sites are significantly less constrained in sequence divergence 
than either sites conserved in the secondary structure or sites conserved in loop. Avoiding regions symptomatic of conformational 
flexibility such as disordered sites and sites with nonconserved secondary structure to identify potential broad-specificity antiviral 
targets, only one sequence motif (five residues or longer) remains from the > 10,000 starting sites across all coronaviruses in this study. 
The identified sequence motif is found within the nonstructural protein (NSP) 12 and constitutes an antiviral target potentially effective 
against the present day and future coronaviruses. On shorter evolutionary timescales, the SARS and MERS clades have more sequence 
motifs fulfilling the criteria applied. Interestingly, many motifs map to NSP12 making this a prime target for coronavirus antivirals. 
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Introduction 


Severe Acute Respiratory Syndrome (SARS)-CoV and Middle 
East Respiratory Syndrome (MERS)-CoV are two closely related 
zoonotic coronaviruses. Both have successfully crossed the 
species barrier to allow animal-to-human transmission, and 
further to allow human-to-human transmission (Song et al. 
2005; Reusken et al. 2016). The SARS outbreak in 2003 had a 
mortality rate of 10% (Anderson et al. 2010), and SARS-CoV 
was considered the most aggressive coronavirus compared to 
other human coronaviruses that commonly cause mild to 
moderate infection in their hosts (van der Hoek 2007). 
MERS-CoV Is the cause of an ongoing outbreak of the respi- 
ratory illness MERS (de Groot et al. 2013). At the time of 
writing, 1791 MERS cases have been confirmed with a mor- 
tality rate of approximately 35% (World Health Organization 


2016). Both MERS and SARS have higher mortality rates in 
elderly and immunosuppressed populations (Gralinski and 
Baric 2015). 

The host changes by MERS-CoV and SARS-CoV suggest 
that other coronaviruses can potentially cross the species 
barrier, become zoonotic, and enable human-to-human 
transmission, ultimately causing high morbidity and mor- 
tality. SARS-CoV and MERS-CoV exploited mechanistically 
different approaches to overcome the human species bar- 
rier, but these two viruses have a lot in common (Lu et al. 
2015). Here, we aim to identity the vulnerable regions in 
the proteomes of coronaviruses that neither SARS-CoV 
nor MERS-CoV nor their contemporary and forthcoming 
relatives can proliferate without, and address how to 
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mobilize a defense against the present and future corona- 
viruses by targeting these regions. 

SARS-CoV and MERS-CoV are positive (+)-strand RNA vi- 
ruses encoding approximately 25 protein products. The MERS- 
CoV proteome is primarily composed of two polyproteins, 
ORF1a and ORF1ab; the latter is generated by a -1 ribosomal 
slippage frameshift. These proteins are cleaved into 16 
nonstructural proteins (NSPs). NSPs 1-10 are products of 
both polyproteins, whereas NSPs 12-16 are only yielded by 
ORF1ab. NSP11 is unique to ORF1a (van Boheemen et al. 
2012). Structural proteins envelope (E), spike (S), membrane 
(M), and nucleocapsid (N) are elements of the physical struc- 
ture that encloses the viral ganome and come trom distinct 
reading frames, unlike ORF1a and ORF1ab, which come from 
overlapping reading frames. Additionally, the structural pro- 
teins are the product of subgenomic mRNAs that are joined 
during discontinuous negative RNA strand synthesis (van 
Boheemen et al. 2012). Finally, NS3 protein (NS3), NS4A pro- 
tein (NS4A), NS4B protein (NS4B), NS5 protein (NS5), and 
Orf8b protein encompass the remainder of the proteome 
and also arise trom distinct reading frames (van Boheemen 
et al. 2012). 

Our approach utilizes genomic sequence data, which Is 
readily available for viruses known to cause disease. 
However, because most viruses pose no major threat to 
their host, they pass by unnoticed leaving the majority of 
virus genome space uncharted. With the availability of cost- 
efficient genome sequencing technology, and recent develop- 
ments in the field of viral metagenomics, large-scale identifi- 
cation of viral genome space is on the rise (Rosario and 
Breitbart 2011; Mokili et al. 2012). By exploring viral diversity, 
critical components constituting a viral genus’ fitness can be 
evaluated. Examples such as the common influenza virus illus- 
trate the rapidity of viral gene mutation and in order to main- 
tain immune protection, an annual flu vaccination Is 
recommended. Underway efforts aim to generate broadly 
neutralizing vaccines whose design accounts for the genomic 
sequences of multiple types of influenza virus to eliminate 
frequent re-vaccination against the flu (Giles and Ross 2011, 
2012). Development of broadly neutralizing vaccines often 
relies on the consensus or ancestral sequences of extant viral 
sequences in order to provide greater coverage for related 
viruses (Kesturu et al. 2006). Unfortunately, consensus se- 
quences can be misleading, and ancestral sequence recon- 
struction is error-prone for quickly diverging sequences 
(McCloskey et al. 2014). In addition, viruses with compact 
genomes often express proteins with structural disorder that 
may undergo structural transformations. Although these 
transtormer proteins, like VP40 in Ebola, are masters at chang- 
ing their structure, and thus expanding their functional reper- 
toire as needed for the life cycle of the virus (Bornholdt et al. 
2013), flexible regions are potentially important in rewiring 
protein-protein interactions between the virus and its host 
(Le Breton et al. 2011; Ortiz et al. 2013; Gitlin et al. 2014). 
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The flexibility trait of many viral proteins is a complicating 
factor In vaccine development. For instance, Dengue virus ex- 
hibits serotype-specitic antibody affinity that causes antibody- 
dependent enhancement, an obstacle in the development of 
Dengue vaccines that protects against all four serotypes (Flipse 
and Smit 2015). To overcome the hurdle posed by structural 
flexibility, we propose an additional screening step in identi- 
tying potential vaccine or antiviral targets that considers the 
structural flexibility of the viral proteins. The Structural 
Genomics Initiatives increased their success rate by excluding 
proteins predicted to be structurally disordered (Slabinski et al. 
2007). A similar approach can perhaps benefit vaccine devel- 
opment. Furthermore, to make this approach robust to po- 
tential mutations, minimizing loss in efficacy or resistance, the 
evolutionary context of sequence and structure must be con- 
sidered. Thus, we suggest expanding the concept of broadly 
neutralizing vaccines/antivirals by increasing the diversity of 
viruses considered If possible. Sites conserved for sequence, 
structure, and with low disorder propensity among diverse 
virus protein homologs are very likely to be constrained trom 
1) changing sequence on evolutionary time scales and 2) un- 
dergoing real-time structural transitions. These sites have po- 
tential as targets for broad-specificity antivirals or vaccines 
because conservation makes them broad-specificity and low 
dynamics avoids targeting a conformational ensemble, which 
is not only difficult (Yu et al. 2016), but that may change as 
the sequence diverges (Siltberg-Liberles et al. 2011). 

A recent large-scale study of structural disorder in >2,000 
viral genomes in 41 viral families found the amount of disorder 
in different virus families varying from 2.9% to 23.1% 
(Pushker et al. 2013). It was reported that Coronaviridae has 
very low disorder content (mean disorder 3.68%) (Pushker 
et al. 2013). Coronaviridae contains two subfamilies: 
Coronavirinae and Torovirinae. SARS-CoV and MERS-CoV 
are part the Coronavirinae subfamily, from here on referred 
to as coronavirus (CoV). The lack of disorder Is intriguing be- 
cause It may be important for rewiring interactions between 
viral proteins and host proteins (Ortiz et al. 2013) and provid- 
iNg Opportunities to acquire novel functional sequence motifs 
(Gitlin et al. 2014). Structural disorder has also been proposed 
to be important for viral viability, enabling multifunctionality 
and vigor in response to changes in the environment (Xue 
et al. 2014). Given the low fraction of structural disorder re- 
ported across Coronaviridae, we set out to investigate the 
conservation of structural disorder and secondary structure 
across CoV. Sites identified as conserved for structure and 
lacking disorder can be considered to be vulnerable and drug- 
gable in the proteomes of coronaviruses. The structural diver- 
gence capacity of these regions is limited, leaving a wider 
range of the present and emergent coronaviruses susceptible 
to the effects of potential broadly neutralizing anti-CoV ther- 
apies targeting these sites. We will refer to these sites as target 
sites. 
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Materials and Methods 


Protein Family Reconstruction 


Protein sequences were identified by individual BLAST 
searches with MERS-CoV (Taxonomy ID: 1335626) proteins 
ORFlab (YP_009047202.1; polyprotein), S protein 
(YP_009047204.1), M protein (YP_009047210.1), E protein 
(YP_009047209.1), and N protein (YP_009047211.1) against 
coronaviruses. BLAST searches of the ORF1ab protein were per- 
tormed, using start and end positions as detailed in the ORF1ab 
NCBI Reference Sequence file, against the refseq_protein data- 
base. The sequences retrieved from the BLAST output main- 
tained the following cutoff: >30% sequence identity 
and >50% coverage relative to MERS-CoV sequence query. 
The 30% sequence identity and 50% query coverage cutoff 
strikes a balance between alignment quality and at least 10 
sequences for most protein families. NSP1 (YP_009047202.1; 
1-193), NSP2 (YP_009047202.1; 194-853),  NS3 
(YP_009047205.1), NS4A (YP_009047206.1), NS4B 
(YP_009047207.1), NS5 (YP_009047208.1), ORF8b protein 
(YP_009047212.1), and NSP11 (YP_009047203.1; 4378- 
4391) are not included in this study due to <10 BLAST hits. 

Multiple sequence alignments were constructed for the se- 
lected BLAST hits using MAFFT (Katoh et al. 2002). 
Phylogenetic trees were constructed using MrBayes 3.2.2 
with a four category gamma distribution and the mixed 
model for amino acid substitution (Huelsenbeck and 
Ronquist 2001; Ronquist and Huelsenbeck 2003). Each tree 
ran for tive million generations, with a sample frequency of 
100. The final tree was constructed from the last 75% of 
samples, discarding the first 25% of samples as the default 
burnin, and using the half-compatible parameter, to avoid 
weakly supported nodes (i.e., with a posterior probability 
<0.5). All trees were midpoint rooted. 

For every protein family, the amino acid substitution rate 
per site in its multiple sequence alignment was calculated 
using empirical Bayesian estimation as implemented in 
Rate4Site (Mayrose et al. 2004). Substitution rates were cal- 
culated using 16 gamma categories, the JTT substitution 
matrix Jones et al. 1992), and the reconstructed phylogenies. 
The rates were normalized per protein family with an average 
across all sites equal to zero and SD equal to 1. This means that 
sites with a rate <0 are evolving slower than average, whereas 
sites with a rate >0 are evolving faster than average. 


Prediction of Intrinsic Disorder Propensity and Secondary 
Structure 


Intrinsic disorder propensity was inferred using two different 
predictors: IUPred (default settings; “long” option) (Dosztany! 
et al. 2005a, 2005b) and DISOPRED2 (Ward et al. 2004) for all 
proteins. For lUPred, the site-specific continuous disorder pro- 
pensities for each protein were mapped onto their corre- 
sponding position in the multiple sequence alignment as 
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raw disorder propensities and as binary states, order or disor- 
der, using two cutoffs of 0.4 and 0.5. Disorder propensities 
below the cutoff were assigned order and disorder propensi- 
ties at the cutoff or above were assigned disorder. For the 
DISOPRED2 predictions that were inferred using the nr data- 
base, the continuous disorder propensities for every site in a 
protein were mapped onto their corresponding position in the 
multiple sequence alignment as raw disorder propensities and 
as binary states, order or disorder, using a cutoff of 5. 
Consequently, for every protein family (a multiple sequence 
alignment and Its corresponding phylogenetic tree), two con- 
tinuous matrices and three binary matrices resulted: IUPred 
0.4, l|UPred 0.5, and DISOPRED2. An additional matrix was 
generated to indicate sites where the binary order and disor- 
der assignments differ between IUPred 0.4 and DISOPREDZ2. 

A similar methodology was employed to analyze secondary 
structure predicted by PSIPRED (McGuffin et al. 2000) and 
JPred (Drozdetskiy et al. 2015). For both predictors, the 
uniref90 database was used and sites were classified as 
loops, alpha helices, or beta strands and mapped back onto 
their corresponding sites in the multiple sequence alignment. 
This resulted in two three-state matrices for each protein 
tamily alignment, one for each predictor, and two binary ma- 
trices displaying secondary structure elements (alpha helix and 
beta strand) or loops. An additional matrix was generated to 
indicate sites where the secondary structure assignments 
differ between PSIPRED and JPred. 

For every protein family, the binary matrices resulting from 
the different disorder predictions and from the different sec- 
ondary structure predictions were analyzed in the correspond- 
ing evolutionary context using GLOOME. GLOOME (Gain-Loss 
Mapping Engine) analyzes binary presence and absence pat- 
terns in a phylogenetic context (Cohen et al. 2010). In this 
study, the Rate4Site option in GLOOME was used to analyze 
the binary matrices (IUPred 0.4, IUPred 0.5, DISOPRED2, 
PSIPRED, and JPred) with the corresponding phylogenetic 
trees to map change of state across sites in each individual 
protein phylogeny (Cohen and Pupko 2010; Cohen et al. 
2010). GLOOME was run with 16 gamma categories and a 
substitution matrix set to equal rates within each state and 
transitions between states treated equally. From the binary dis- 
order and order matrices, transition rates between disorder and 
order or vice versa (DOT) were estimated. From the binary struc- 
ture and loop matrices, transition rates between structure and 
loop or vice versa (SLT) were estimated. Similar to Rate4Site, the 
rates were normalized per protein family with an average 
across all sites equal to zero and SD equal to 1. This means 
that sites with a rate <O are evolving slower than average, 
while sites with a rate >0 are evolving faster than average. 


Protein Family Visualization 


Protein families were visualized in an integrative manner with 
a phylogenetic tree, any matrix (multiple sequence alignment 
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or predictor based) displayed as a heatmap, and site-specific 
sequence transition rates using Python packages ETE3 
(Huerta-Cepas et al. 2016) and Matplotlib (Hunter 2007). 


Statistical Analysis of Amino Acid Evolutionary Rate 
Distributions 


Amino acid evolutionary rates (SEQ) for all sites across all align- 
ments were aggregated and binned into four possible cate- 
gories characterized by the distribution of PSIPRED predicted 
secondary structure at each site. Sites predicted to have a loop 
across all sequences are “conserved loops; C(L)” and sites 
predicted to have a helix across all sequences or a strand 
across all sequences are “conserved helix-strand; C(HS)” 
(table 3). Sites predicted to have all three states (helix, 
strand, and loop) or any combination of loop and one other 
state are “non-conserved helix, loop, strand; NC(HLS)” and 
sites predicted to have a mixture of helix and strand are “non- 
conserved helix-strand; NC(HS)” (table 3). In all cases, gaps 
were ignored when classifying combinations of secondary 
structure at a site or if secondary structure conservation 
exists at a particular site. 


Results 
Phylogenies 


Phylogenies were built for all protein products encoded in the 
MERS-CoV single-stranded RNA genome, except for NSP1, 
NSP2, NS3, NS4A, NS4B, ORF8b protein, and NSP11, all of 
which had insufficient sequence data (< 10 sequence hits with 
BLAST). 

NSP12 is often used as a measure for newly identified 
coronaviruses. According to the International Committee of 
Taxonomy of Viruses, a major criterion in determining If a 
coronavirus Is considered novel is pairwise sequence identity 
below 90% for NSP12 in all comparisons to previously known 
coronaviruses (Bermingham et al. 2012). Four main clades, 
alphacoronavirus, betacoronavirus, gammacoronavirus, and 
deltacoronavirus (fig. 1), are identified in agreement with 
the taxonomic classifications described by the ICTV 
(International Committee on Taxonomy of Viruses 2015). 
Coronaviruses not listed by the ICTV are assumed to be a 
part of the clade in which representatives with known classi- 
Tications are situated in our NSP12 phylogeny. 

The MERS clade and SARS clade are sister clades in the 
NSP12 phylogeny. The HKU1 clade and EQU clade are also 
sister clades. Together these four clades form the 
Betacoronavirus clade, in accordance with the ICTV classifica- 
tion (International Committee on Taxonomy of Viruses 2015). 
Betacoronavirus is represented in all phylogenies although the 
order of the individual subclades varies. Alphacoronavirus Is 
often found as the sister clade or outgroup to 
betacoronavirus. Deltacoronavirus or gammacoronavirus are 
the most distantly related to the betacoronavirus. In the 
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nucleocapsid phylogeny, gammacoronavirus Is the first out- 
group clade to betacoronavirus, and alphacoronavirus is the 
most distant outgroup. Most NSP trees exhibit some unre- 
solved nodes at junctures immediately preceding terminal 
nodes. As an effect of the 50% majority rule, most of the 
546 resolved nodes are well supported with posterior proba- 
bility >0.9 for 82% and >0.99 tor 68% (supplementary fig. 
S1, Supplementary Material online). Most trees follow the 
NSP12 topology for the main clades, with minor clade rear- 
rangements. It should be noted that for NSP5, the entire 
alphacoronavirus clade is placed within the betacoronavirus 
clade, as a sister clade to the MERS clade (supplementary Tig. 
S1, Supplementary Material online). This may be due to in- 
creased sequence divergence rates or due to recombination. 
Recombination events are rather frequent in coronaviruses (Su 
et al. 2016), and the MERS clade potentially underwent mul- 
tiple recombination events as part of the host change (Zhang 
et al. 2016). 

The phylogenies for membrane protein, spike protein, 
NSP5, and NSP8-NSP16 demonstrate (with the given BLAST 
cutoffs) recoverable protein homologs such that all corona- 
viruses are represented (i.e., all coronaviruses represented in 
the NSP12 phylogeny). Nucleocapsid, NSP4, and NSP7 have 
recoverable homologs in all clades except deltacoronavirus. 
NSP3 and NSP6 homologs are too divergent in 
deltacoronavirus and/or gammacoronavirus relative to 
MERS-CoV. Envelope appears specific to betacoronavirus 
(fig. 1), but it is a short protein that has been found to diverge 
rapidly and is likely present outside betacoronavirus (Fehr and 
Perlman 2015). Because different protein families yield slightly 
different phylogenies, tor the remaining evolutionary analyses, 
every protein family was analyzed in the context of its own 


phylogeny. 


Intrinsic Disorder Is Rarely Conserved 


For all protein families, structural disorder propensities were 
predicted using IUPred (Dosztanyi et al. 2005a, 2005b) and 
DISOPRED2 (Ward et al. 2004). To verify the robustness of the 
binary |UPred and DISOPRED2 predictions, the binary assign- 
ments were compared on a site-by-site basis (table 1). When 
converted to binary (i.e., two states per site disordered or or- 
dered) IUPred 0.4 and IUPred 0.5 are in good agreement with 
the larger differences seen for NSP8, NSP9, and nucleocapsid 
(7.5%, 6.5%, and 19.0%, respectively) (table 1). Comparing 
IUPred 0.4 or lUPred 0.5 to DISOPREDZ2, large differences are 
in particular seen for nucleocapsid (38.7% and 29.7% respec- 
tively) and NSP8 (23.5% and 25.9%, respectively) (table 1). 
For nucleocapsid, regions that are found to be disordered by 
IUPred 0.4 are found to be ordered by |!UPred 0.5 and 
DISOPRED2 (fig. 2 and supplementary fig. S2, 
Supplementary Material online). For NSP8, regions that are 
only slightly disordered in a few sequences according to 
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Fic. 1.—CoV representative phylogeny. (A) NSP12 is a representative for the CoV protein phylogenies, colored by clade (alophacoronavirus or FIPV, green; 
betacoronavirus has four different subclades: SARS, blue; MERS, gray; HKU1, pink; EQU, purple; gammacoronavirus or SW1, yellow; deltacoronavirus or 
HKU19, white.) Posterior probability indicating node support is shown in red. (B) Protein family distribution across coronavirus based on the given cutoff 
(>30% sequence identity and >50% coverage relative to MERS-CoV sequence query). Clade color applied throughout the remaining figures. Areas shaded 
in gray with an arrow indicate that the protein family is not identified for that clade with the given cutoffs, but is found from the arrow tip. See supplementary 


fig. S1, Supplementary Material online, for the remaining phylogenetic trees. 


lUPred 0.4 and |UPred 0.5, DISOPRED2 predicts disorder to be 
conserved for all sequences (Tig. 3). 

To quantify the fraction of disordered sites per protein 
family, we report the IUPred 0.4 results only for simplicity 
(table 1). In general, IUPred 0.4 predicts more disorder than 
DISOPRED2, but several protein families have almost no dis- 
ordered sites. NSP3 and NSP8-10 have some variation in dis- 
order content for different viruses. Based on the fraction of 
disorder, nucleocapsid is the only highly disordered protein 
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among the CoVs in this study, even if NSPs 8-10 have outliers 
that are >20% disordered. 

To compare the disorder-to-order transition rates (DOT) 
tor all protein families where the binary matrices of disorder 
and order include both states, the quadrant count ratio 
(QCR) was estimated as a measure of association in assign- 
ing slower than average vs. taster than average transition 
rates. For IUPred 0.4 vs. IUPred 0.5, for IUPred 0.5 vs 
DISOPRED2, and for |UPred 0.4 vs. DISOPRED2, the QCRs 
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Table 1 
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Protein Family Wide Disagreement of Disorder and Secondary Structure Predictions 


Predictor l|UPred 0.4 I|UPred 0.4 

VS. VS. 
Protein lUPred0.5 DISOPRED2 

family % % 
NSP3 Z| 4.6 
NSP4 0.1 0.9 
NSP5 0.8 3.1 
NSP6 0.0 0.8 
NSP7 0.0 5.4 
NSP8 1.5 23.0 
NSP9 65 11.0 
NSP 10 3.9 12.4 
NSP12 0.5 1.4 
NSP13 1.0 8.8 
NSP 14 05 0.8 
NSP15 0.9 20 
NSP 16 0.2 1.4 
ENVELOPE 0.3 10.0 
NUCLEOCAPSID 19.0 Siem A 
MEMBRANE 1.0 9.5 
SPIKE 0.6 2.6 
eae 1.9 5.6 


DISAGREEMENT 


lUPred 0.5 PSIPRED Disorder fraction’ 

VS. VS. lUPred 0.4 
DISOPRED2 JPred 

% % 0 02 04 06 08 

3.2 2i2 FE - 

0.8 22.6 } 

2.4 13.7 I 

0.8 18.6 } 

bo 8.1 4 

25.9 (es hh 1 

7.2 20.4 nina 

9.8 24.6 {Fh 4 

1:0 20.2 ie 

8.5 25.4 hi 

0:3 16.7 ip 

1.1 16.3 FH 

1.3 14.1 re 

10.4 29.7 } 

ZO 10.2 + Fb 

10.2 20.4 te 

Z| Val lb 

4.7 18.8 


‘Tukey boxplot constructed using the |UPred 0.4 predicated disorder fraction (number of disordered sites/total sites) per sequence per protein. Green dots represent 


outliers; red diamond are the mean and red lines are the median values. 


were 0.76, 0.69, and 0.63, respectively. This shows a 
strong positive association for site-specitic DOT for all 
methods and cutoffs, with |UPred 0.4 vs. IUPred 0.5 
being the strongest (table 2). For nucleocapsid and 
NSP8, the positive associations are weaker, suggesting 
that many sites have |UPred disorder propensity in the 
0.4 to 0.5 range and large differences between IUPred 
and DISOPRED2, in accordance with the large disagree- 
ment between the binary assignment of these predictors 
(tables 1 and 2). 


Secondary Structure Prediction and Structure-to-Loop 
Transitions 


For all protein families, secondary structure elements were 
predicted using PSIPRED (McGuffin et al. 2000) and JPred 
(Drozdetskiy et al. 2015). For most protein families, the 
disagreement between secondary structure predictors Is 


greater than for the disorder predictors (table 1). In fact, 
15 of the 17 protein families compared disagree at more 
than 10% of alignment sites, and two of these disagree at 
more than 20% of sites. To compare the binary structure- 
to-loop transitions (SLT), QCR was estimated as a measure 
of association for SLT based on the different predictors. In 
general, there is a moderate positive association between 
SLT for PSIPRED vs. SLT for JPred that is weaker than for 
the different DOT comparisons (table 2). It should be 
noted that SLT does not differentiate between alpha 
helix and beta strand, but considers both as “structure.” 
This is a correct assumption if protein structure Is con- 
served and consistently predicted, but for some protein 
families that is not the case. 

Four protein families (NSP3, NSP12, NSP13, and SPIKE) 
have more than 40% of their sites found within the 
NC(HLS) category with non-conserved helix, strand, and 
loop (two or three states present at the same site) (table 
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Fic. 2.—The evolutionary context of intrinsic disorder in nucleocapsid. The phylogenetic tree was built using the multiple sequence alignments for 
nucleocapsid. Here, the multiple sequence alignment is colored by disorder propensity (with gaps in gray): (A) IUPred 0.4, blue-to-white-to-red shows 
disorder propensity according to the scale for |UPred 0.4. (B) |UPred 0.5, blue-to-white-to-red shows disorder propensity according to the scale for IUPred 0.5. 
(C) DISOPRED2, blue-to-white-to-red shows disorder propensity according to the scale for DISOPRED2. Above the heat maps, the normalized evolutionary 
rates per site for amino acid substitution (SEQ) and the DOT for the binary transformations of A—C are shown. Heat maps visualized with the Python packages 


ETE3 (Huerta-Cepas et al. 2016) and Matplotlib (Hunter 2007). 


3). For NSP13, JPred predicts 72% of all sites to be a mixture 
of helix, strand, and loop, or any combination of loop and 
one other structural element (fig. 4). Envelope and NSP6 
have 13% and 12% of their respective sites in the NC(HS) 
category. Considering only the PSIPRED predictions, the 
NC(HS) category has 245 sites across all 17 protein families. 
That is one-tenth the size of the next smallest set which is 
C(HS) with 2275 sites. Next, C(L) has 3344 sites, and the 
largest category is NC(HLS) with 4257 sites. Comparing the 
evolutionary sequence rates for the sites in the different cat- 
egories, based on PSIPRED predictions only, reveals that sites 
in the C(HS) category are evolving at a slower rate than all 
other categories. NC(HS) is only just significantly different (P 
= 4.62E—03) from C(HS), and is not significantly different 
from NC(HLS) and C(L) (P=1.85E—02 and P=8.33E—01, re- 
spectively). However, NC(HLS) and C(L) are significantly dif- 
ferent from each other, and both are significantly different 
trom C(HS) (P=1.82E—46 and P =2.33E—21, respectively) 
(fig. 5). 


Identifying Target Sites 


For regions with five or more consecutive sites that were 
100% conserved in sequence across 1) all CoV or 2) across 
the MERS and SARS clades, the information of structural dis- 
order prediction from |IUPred and DISOPRED2 was used to 
identify all ungapped sites that were consistently predicted 
to have 100% conserved order. Next, the information of sec- 
ondary structure prediction trom PSIPRED and JPred was used 


to narrow down this list further by only including sites that are 
not changing their predicted secondary structure state for 
both predictors. Applying the aforementioned filters to the 
initial 10,000 sites resulted in one (1) region of five residues 
or more conserved across all CoV within the N-terminal 
domain of NSP12: DNQDL (table 4). Interestingly, this region 
is in the vicinity of sites found important tor nucleotidylating 
activity across the order Nidovirales (Lehmann et al. 2015). 

Considering only the sequences in the SARS and MERS 
clades, 21 sequence regions of five residues or more were 
found in seven protein families (table 4). For NSP5, NSP7, 
and NSP14, experimentally determined structures show that 
most regions are surface accessible (fig. 6). Some of the iden- 
tified target sites are known for their functional importance. 
For instance, C145 in the middle of GSCGS in NSP5 is part of 
the catalytic dyad in the NSP5 protease (Yang et al. 2003). For 
NSP12 and NSP13, which have the majority of all sites, no 
structures are available. The sites adjacent to DNQDL are 
also conserved in the SARS and MERS clades, and five addi- 
tional target sites, conserved for the SARS and MERS clades, 
are found in the C-terminal direction relative to the DNQDL 
motif (table 4). Continuing into the RNA-dependent RNA po- 
lymerase domain (RdRP) in NSP12, four additional regions of 
target sites are found, and the last three regions are found in 
the C-terminal part. Importantly, in RdRP and in the C-terminal 
part are sites that are also conserved across all CoVs in this 
study. NSP13 has four regions of target sites distributed across 
the protein. 
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Fic. 3.—The evolutionary context of intrinsic disorder in NSP8. The phylogenetic tree was built using the multiple sequence alignments for NSP8. (A) The 
multiple sequence alignment is colored by amino acid according to scale, arranged based on TOP-IDP disorder promoting propensity of the amino acids 
(Campen et al. 2008), and gray denotes gaps. (B) |UPred disorder propensity per site in the multiple sequence alignment. Blue-to-white-to-red shows disorder 
propensity according to the scale for |UPred 0.4. (C) |UPred disorder propensity per site in the multiple sequence alignment. Blue-to-white-to-red shows 
disorder propensity according to the scale for |UPred 0.5. (D) DISOPRED2 disorder propensity per site in the multiple sequence alignment. Blue-to-white-to- 
red shows disorder propensity according to the scale. Above the multiple sequence alignment, the normalized evolutionary rates per site for amino acid 
substitution (SEQ) and the DOT for the binary transformations of B—D are shown. Heat maps visualized with the Python packages ETE3 (Huerta-Cepas et al. 
2016) and Matplotlib (Hunter 2007). See supplementary figures S2 and S4, Supplementary Material online for additional graphics for every protein family. 
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Table 2 
QCR* for DOT and SLT 
Rate 
Protein family DOT DOT DOT SLT 
lUPred 0.4 vs. |UPred 0.5 IUPred 0.5 vs. DISOPRED2 IUPred 0.4 vs. DISOPRED2 PSIPRED vs. JPred 

NSP3 0.75 0.68 0.61 051 
NSP4 N/AP N/A 0.58 0.61 
NSP5 0.72 0.84 O75 0.65 
NSP6 N/A N/A N/A 0.70 
NSP7 N/A N/A 0.9 0.66 
NSP8 0.86 0.43 0.38 0.67 
NSP9 0.76 0:55 0.62 0.67 
NSP10 0.93 0.67 0.68 0.42 
NSP12 0.96 0.89 0.86 0.57 
NSP13 0.76 0.70 0.59 0.36 
NSP14 0.87 0.85 0.8 0.58 
NSP15 0.82 0.85 0.67 0.53 
NSP16 0.93 0.86 0.85 0.59 
Envelope N/A N/A 0.58 0.53 
Membrane 0.54 0.57 0.67 0.55 
Nucleocapsid 0.33 0.43 0.31 0.61 
Spike 0.81 0.62 0.49 0.55 


All 0.76 0.69 0.63 0.55 


*QCR: Quadrant Count Ratio measures the association for the same site-specific rate with different predictors or cutoffs. 
PN/A: at least one of the rates in the comparison could not be estimated due to the lack of any disordered state in the binary state matrix (supplementary fig. 55, 
Supplementary Material online). 


Discussion iN rewiring protein-protein interactions between virus and 
host (Gitlin et al. 2014). Although most CoV proteins have 
almost no intrinsic disorder, several CoV protein families have 
homologous sites that display loop in some sequences, helix in 
others and strands in some (table 3, supplementary Tig. S3, 
Supplementary Material online). These sites are not necessarily 
disordered but they may be conformationally flexible in real- 
time (with secondary structure transitions in the same se- 
quence, making them difficult to predict) or on evolutionary 
time-scales (so that different secondary structure elements ac- 
tually are present in different sequences). The C(HS) and C(L) 
sites make up approximately 50-80% of most multiple se- 
quence alignments. With the common expectation that pro- 
tein structure Is more conserved than sequence these 
numbers are surprisingly low. Neither PSIPRED nor JPred con- 
sistently predicts the same state for 20-50% of all sites in 
these multiple sequence alignments. 

The accuracy of PSIPRED and JPred’s secondary structure 
predictions are about 80% (Bryson et al. 2005; Drozdetskiy 
et al. 2015). PSIPRED has been found to rarely predict an 
alpha helix instead of a beta strand and vice versa, and most 
of the PSIPRED errors are due to secondary structure not being 
predicted (Li et al. 2014). When secondary structure is not 
conserved for the same site in a multiple sequence alignment, 
it suggests that the secondary structure prediction may be 1) 
inaccurate, 2) not predicted with high confidence, or 3) the 


We have analyzed the protein evolution of the genetic com- 
ponents that make up the MERS-CoV proteome. As previously 
established, MERS-CoV has the same genomic makeup as 
HKU4-CoV and HKU5-CoV in the MERS clade (Woo et al. 
2012). Some protein products are only found in the MERS 
clade, and these were excluded from this study due to insuf- 
ficient data. Furthermore, tor other protein products, some 
clades may not be represented in our protein families if their 
proteins were too divergent. This was an important factor in 
determining the applied BLAST hit cutoffs, as relaxing cutofts 
produced alignments with more gaps and increasing strin- 
gency reduced the representative pool. Because alignment 
quality is important due to the sensitivity of both Rate4Site 
and for phylogenetic reconstruction, the chosen cutoffs are 
suitable. We note some clade-specitic differences in recover- 
able homologs between different CoV, but many components 
are shared among them (fig. 1). 

Viral proteins often possess multifunctionality, mediated by 
a conformational change in response to environment-specitic 
factors (Xue et al. 2014). Although conformational flexibility is 
important for function, it also offers flexibility in what se- 
quence motifs are on display. If these sequences are rapidly 
diverging, different sequence motifs will be displayed, reintor- 
cing the notion that flexible regions are potentially important 
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Table 3 
Structural Conservation of Sites Per Protein Family 


Conserved Sites 
Loop 


Conserved Sites 
Helix OR Strand 


C(HS) C(L) 


GBE 


Non-conserved sites 
Helix AND Strand AND 
Loop 
NC(HLS) 


Non-conserved sites 
Helix AND Strand 
NC(HS) 


Protein | 


% sites per protein family alignment per predictor | 


Family PSIPRED JPred PSIPRED JPred PSIPRED JPred PSIPRED JPred 
NSP3 18.64 16.46 | 2854 24.45 1.88 2.88 
NSP4 29.74 | 29.00 22.49 7.06 1.67 
NSP5 34.88 | 37.04 0.31 0.93 
NSP6 | 10.59 11.21 124s O12 
NSP7 | 24.10 24.10 0.00 0.00 
NSP8 | 21.05 24.88 1.91 0.96 
NSP9 34.48 38.79 0.86 jer 
NSP10 2.05 6.16 
NSP12 24.74 25.05 | 3417 27.46 1.47 3.35 
NSP13 16.67 7.43 See 17.33 1.82 ed 
NSP14 PTE 29.49 | 1.47 1A 
NSP15 Pes | | oe 2.20 4.39 
NSP16 23.87 22.26 | 4.19 4.52 
Envelope eh | 16.67 16.67 13.33 7.78 
Membrane 35.51 26.09 24.28 9.06 1.81 
Nucleocapsid 8.78 Wve, | 0.69 0.86 
Spike 18.35 19.52 | 2885 26.15 | 0.81 tear 
ater | bo75 «i 3344 al 4257 a 245 - 
of sites 
side 0-10 10-20 >20-30 [aoeaG % 


regions are indeed metamorphic; they can transition from one 
element to another. Although (1) is difficult to address without 
experimentally determined structures for all sequences, (2) and 
(3) are not necessarily incompatible interpretations because 
low confidence secondary structure prediction could indicate 
metamorphic secondary structure regions. Metamorphic sec- 
ondary structure regions have interesting consequences for 
conformational and functional flexibility. 

It should be noted that, despite the low amount of disor- 
dered sites in most CoV proteins, several regions are not 
conserved in disorder propensity across all sequences, but 
sometimes the different predictors disagree as in the case 
of NSP8. Clade-specitic disordered regions resulting from 


indel events suggest that they are not essential to the critical 
functions of the protein, but could cause gain-and-loss of 
interactions with its hosts. However, when disorder propen- 
sity is only mildly fading for a region that is present across the 
protein family, tt may be important for the fundamental func- 
tion of the protein. The virus structural proteins that interact 
to form the virion commonly include an envelope protein, a 
membrane protein, and a capsid protein that together form 
the machinery that encases, transports, and releases the 
virus. The interactions between the structural proteins are 
often regulated by conformational changes like VP40 in 
Ebola (Bornholdt et al. 2013) and Envelope protein from 
Dengue virus (Zheng et al. 2014). Contormational changes 
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Fic. 4.—The evolutionary context of secondary structure in NSP13. The phylogenetic trees were built using the multiple sequence alignments for NSP13. 
(A) The multiple sequence alignment is colored as in fig. 3. (B) PSIPRED secondary structure prediction per site in the multiple sequence alignment, color 
coded according to the scale. (C) JPred secondary structure prediction per site in the multiple sequence alignment, color coded according to the scale. Above 
the multiple sequence alignment, the normalized evolutionary rates per site for sequence substitution (SEQ) and SLT based on the binary transformations of 
B-C are shown. Heat maps visualized with the Python packages ETE2 (Huerta-Cepas et al. 2016) and Matplotlib (Hunter 2007). See supplementary figs. $3 
and S4, Supplementary Material online for a complete set of graphics for every protein family. 


in these proteins are needed for the virus life cycle. For CoVs, and 0.5) and with DISOPREDZ2. Even if the different predictors 
nucleocapsid is the only structural protein that is highly dis- and cutoffs disagree somewhat where regions with rapid 
ordered. Yet, rapid evolutionary dynamics of disorder is pre- evolutionary dynamics are present, these patterns suggest 
sent in nucleocapsid using two different IUPred cutoffs (0.4 that nucleocapsid may be rapidly changing from one virus 
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Fic. 5.—Comparison of SEQ at sites characterized by secondary structure. All pairwise rate distributions, except NC(HS) vs. NC(HLS) and NC(HS) vs. C(L), 
are significantly different (P< 0.05, after Bonferroni correction: P < 0.008). For a summary of the U statistic and two-tailed P values for each pairwise 


comparison see supplementary table $2, Supplementary Material online. 


Table 4 
Sites Conserved in Sequence and Structural Property 


Protein family 


NSP5 Peptidase_C30 
NSP7 Replicase 
NSP12 RPol N-term 
RdRP_1 
C-term 
NSP13 N-term 
AAA_30 
AAA_12 
NSP14 NSP11 


PfamA domain 


Conserved sites in the MSA? 


149-GSCGS-153° 
213-AWLYAA-218° 
7-KCTSVVLL-14° 
16-VLQQL-20° 
228-LDNQDLNG-235 
239-DFGDF-243 
521-DKSAG-525 
588-MTNRQ-592 
677-LANECAQVL-685 
800-GGTSSGD-706 
853-YPDPSR-858 
871-KTDGT-875 
889-YPLTK-893 
10-SQTSLR-15 
362-NALPE-366 
402-DPAQLP-407 
539-SSQGS-543 
281-AHVAS-285° 
290-MTRCLA-295° 
438-HAFHT-442° 
494-CNLGG-499° 


“Sites conserved across all clades in the protein family are underlined and in BOLD font. All other sites are conserved across the SARS and MERS clades. 
Experimentally determined structures are available in Protein Data Bank (Berman 2000).. 


to another. It should also be noted that two MERS clade 
specific inserts around position 241 and toward the C-termi- 
nal are consistently predicted to be highly disordered. With 
inserts and changing structural dynamics between clades or 
viruses, the questions become 1) which sequence motif are 
displayed and 2) to what extent are these sequence motifs 
displayed? 

Furthermore, based on the inconsistent prediction of sec- 
ondary structure elements, the possibility that CoVs are more 


contormationally flexible than their intrinsic disorder content 
implies is noteworthy. Altogether, this suggests that various 
mechanisms for rewiring conformational and functional space 
are operating in the coronaviruses studied here. If regions 
symptomatic of conformational and functional flexibility can 
be avoided in order to identify broad-specificity antiviral tar- 
gets with potential to be effective against coronaviruses of 
today and in the future, coronaviruses as a group may 
become more attractive drug targets tor the pharmaceutical 
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Fic. 6.—Target sites shown in 3D context. (A) NSP5 dimer, based on PDB id 1UK4 (Yang et al. 2003). (B) NSP7, based on PDB id 5F22 (unpublished). (C) 
NSP14, based on PDB id 5C8T (Ma et al. 2015). Protein structure visualized with Bioviva Discovery Studio . 


industry in the event an additional coronavirus changes host to 
include humans or increase its virulence. 


Supplementary Material 


Supplementary tables $1 and S2 and figures S1-S5 are avail- 
able at Genome Biology and Evolution online (http:/Avwww. 
gbe. oxfordjournals.org/). 
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